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Abstract 

In  this  paper  we  describe  the  TNO  ap¬ 
proach  to  large-scale  polarity  classification 
of  the  Blog  TREC  2008  dataset.  Our 
participation  consists  of  the  submission  of 
the  5  baseline  runs  provided  by  NIST,  for 
which  we  applied  a  multinomial  kernel  ma¬ 
chine  operating  on  character  n-gram  rep¬ 
resentations.^ 


1  Introduction 

The  polarity  task  of  Blog  TREC  2008  consists  of 
retrieving  and  ranking  for  each  of  a  total  of  150 
topics  (queries)  the  positive  and  negative  opini- 
ated  documents  in  the  test  collection.  TREC 
has  made  available  5  topic-relevance  baseline 
runs,  to  which  polarity  classification  or  opinion 
finding  techniques  can  be  applied.  This  allows 
participants  to  focus  on  one  aspect  of  the  pro¬ 
cessing  chain.  In  this  contribution,  we  describe 
the  result  of  applying  the  TNO  polarity  classi¬ 
fication  approach  to  these  5  baselines.  We  dis¬ 
cuss  the  results  of  our  submissions  in  section  5 
and  present  conclusions  and  lessons  learned  in 
section  6.  In  the  next  three  sections,  we  de¬ 
scribe  the  data,  our  feature  representation,  and 
the  general  outline  of  our  setup. 

^This  work  was  supported  by  the  European  1ST  Pro¬ 
gramme  Project  FP6-0033812.  This  paper  only  reflects 
the  authors’  views  and  funding  agencies  are  not  liable  for 
any  use  that  may  be  made  of  the  information  contained 
herein. 


2  Data  and  pre-processing 

The  TREC  BlogOG  collection,  a  148  Gigabytes 
sample  of  the  blogosphere,  is  the  result  of 
an  eleven-week  period  crawl  (December  2005- 
February  2006).  Due  to  the  automated  crawling 
process,  the  dataset  contains  not  only  legitimate 
blog  postings,  but  also  spam,  javascript,  home- 
pages  and  RSS  feed  material.  The  data  itself 
consists  of  raw  HTML,  with  a  total  of  over  3.2 
million  documents.  In  order  to  train  a  classi¬ 
fier  on  these  class-labeled  web  pages,  these  doc¬ 
uments  have  to  be  cleaned  up  and  converted  to 
plain  text,  which  is  by  far  not  a  trivial  task.  Our 
HTML  to  text  conversion  strategy  consists  of  a 
dedicated  DOM-parser  effectively  stripping  the 
larger  part  of  HTML  tags  and  javascript  code. 

We  combined  this  parser  with  the  html2text 
Python  script^  in  sequence:  following  our  dedi¬ 
cated  parser,  we  applied  html2text  .py.  While 
this  produced  reasonably  clean  text,  we  found 
that  in  a  lot  of  cases  the  output  data  still  con¬ 
tained  tags  and  programming  constructs.  We 
surmise  that  our  results  are  to  a  large  extent 
influenced  by  this  imperfect  data  preprocessing. 

3  Character  n-gram  representations 

We  opted  for  a  character  n-gram  approach  to  the 
polarity  classification  task.  For  every  training 
document,  we  generated  word  boundary  tran¬ 
scending  character  n-grams  from  2  up  to  6  char¬ 
acters.  That  is,  the  transition  between  two  con¬ 
secutive  words,  including  the  white  space  char- 

^ Available  from  http:/ /www.aaronsw.com/2002/html2text/ 
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acter,  is  expressed  as  an  n-gram.  For  the  sen¬ 
tence  ’This  car  really  rocks’  subword  character 
bigrams  and  trigrams  (’subgrams’)  are 

th,  hi,  is,  ca,  ar,  re,  ea,  al, 

11,  ly,  ro,  oc,  ck,  ks,  thi,  his,  (1) 

car,  rea,  eal,  all,  lly,  roc,  ock,  cks. 

A  bigram  and  trigram  representation  that 
spans  word  boundaries  produces 

th,  hi,  is,  s#,  #c,  ca,  ar,  r#, 

#r,  re,  ea,  al,  11,  ly,  y#,  #r, 

ro,  oc,  ck,ks,  thi,his,  is#,  s#c,  .  . 

#ca,  car,ar#,r#r, 

#re,  rea,  eal,  all,  lly,  ly#, 
y#r,  #ro,  roc,  ock,  cks 

with  #  a  whitespace  indicator. 

Every  document  is  represented  by  a  term 
vector  consisting  of  Li -normalized  character  n- 
gram  frequencies.  In  our  recent  work  (Raaij- 
makers  and  Kraaij,  2008);  (Wilson  and  Raai- 
jmakers,  2008);  (Raaijmakers  et  ah,  2008)  we 
have  found  ample  evidence  for  the  informativ- 
ity  of  character  n-grams.  In  (Raaijmakers  et  ah, 
2008)  we  demonstrated  for  a  large  array  of  ex¬ 
periments  that  character  n-grams  are  the  most 
informative  source  of  information  compared  to 
phonemes,  prosody  and  word  n-grams.  These 
low-level  features  in  fact  implement  a  form  of 
attenuation  (Eisner,  1996):  a  slight  abstraction 
of  the  underlying  data  that  leads  to  the  forma¬ 
tion  of  string  equivalence  classes.  For  instance, 
words  in  a  sentence  will  invariably  share  many 
character  n-grams.  Since  every  unique  character 
n-gram  in  an  utterance  constitutes  a  separate 
feature,  this  produces  string  classes,  which  is  a 
form  of  abstraction.  Zhang  and  Lee  (2006)  in¬ 
vestigate  similar  subword  representations,  called 
key  substring  group  features.  By  compressing 
substrings  in  a  corpus  in  a  trie  (a  prefix  tree), 
and  labeling  entire  sets  of  distributionally  equiv¬ 
alent  substrings  with  one  group  label,  an  atten¬ 
uation  effect  is  obtained  that  proves  very  bene¬ 
ficial  for  a  number  of  text  classification  tasks. 

Aside  from  attenuation  effects,  character 
n-grams,  especially  those  that  contain  word 
boundaries,  have  additional  benefits.  Treating 


word  boundaries  as  characters  captures  micro- 
phrasal  information:  short  strings  that  express 
the  transition  of  one  word  to  another.  Stemming 
occurs  naturally  within  the  set  of  initial  charac¬ 
ter  n-grams  of  a  word,  where  the  suffix  is  left 
out.  In  addition,  some  part-of-speech  informa¬ 
tion  is  captured.  For  example,  the  modals  could, 
would,  should  can  be  represented  by  the  4-gram 
ould.  Likewise,  the  set  of  adverbs  ending  in  -ly 
can  be  concisely  represented  by  the  3- gram  ly#. 

4  Geodesic  kernels 


Recent  work  on  document  classifiation  has 
demonstrated  the  benefits  of  geodesic  kernels 
(Lafferty  and  Lebanon,  2005):  support  vector 
machines  that  deploy  geodesic  distance  mea¬ 
sures  on  Li-normalized  data.  Li  normalization 
corresponds  to  normalizing  the  frequencies  (I’D 
of  a  bag  of  events  D  —  wi, . . . ,  Wn^  where  |  Wi  \ 
is  the  frequency  of  event  Wi  in  D: 


|^;i|  \Wn\ 


} 


(3) 

Li-normalization  of  data  entails  an  embedding 
of  this  data  into  the  multinomial  manifold 
an  infinitely  differentiable,  curved  information 
space  that  is  isomorphic  to  the  parameter  space 
of  the  multinomial  distribution.  This  informa¬ 
tion  space  has  geodesic  properties:  it  is  locally 
Euclidean  and  globally  curved.  Distances  be¬ 
tween  points  therefore  are  best  measured  using 
locally  Euclidean  and  globally  geodesic  distance 
measures.  Technically,  the  multinomial  mani¬ 
fold  is  isometric  to  the  positive  portion  of 
the  n-sphere  with  radius  2,  (Kass,  1989; 
Lebanon,  2005): 


S”  ={<^eR’^+V||</.||=2,Vi,#>0}  (4) 


by  a  diffeomorphism  F  :  F’^ 

F(a;)  =  (2v^i,...,2v^„+i)  (5) 


This  allows  for  measuring  distance  with  a  kernel 
K  between  two  vectors  x,y  in  the  space  S!f:: 

K{F{x),F{y)).  (6) 


where  the  shortest  path  connecting  these  two 
points  in  hyperspace  actually  is  a  segment  of  a 
great  circle. 


Raaij makers  (2007)  demonstrates  that  multi¬ 
nomial  kernels  based  on  geodesic  distance  are 
able  to  produce  state  of  the  art  results  for  sen¬ 
timent  polarity  classification  tasks. 

In  the  experiments  reported  in  this  work, 
we  use  a  simple,  hyperparameter-free  multino¬ 
mial  kernel,  the  negative  geodesic  kernel  K^qd 
(Zhang  et  ah,  2005): 

Kngd{x,  y)  =  -2  arccos  (7) 

Notice  that  this  kernel  combines  a  locals  Eu¬ 
clidean  notion  of  similarity  with  a  geodesic  no¬ 
tion  of  similarity:  the  vector  product  expresses 
cosine  similarity,  and  the  inverse  cosine  the  mea¬ 
surement  of  distance  along  a  curve. 

Expanding  the  TREC  data  to  character  n- 
grams  leads  to  a  huge  expansion  of  data.  Due 
to  memory  constraints  of  our  systems,  we  took 
a  random  portion  of  training  data  of  only  16% 
(amounting  already  to  over  250  megabytes  of 
training  data). 

4.1  Thresholding  decision  values 

Support  vector  machines  output  decision  values 
that  either  are  discretized  to  binary  classes  (a 
negative  value  produces  a  negative  class  label, 
and  a  positive  value  a  positive  class  label),  or 
probabilities  (e.g.  (Platt,  1999)).  We  used  the 
raw  decision  values  for  ranking  the  various  pos¬ 
itive  and  negative  cases.  We  devised  a  simple 
threshold  estimator  that,  on  the  basis  of  class 
distribution  priors  in  the  training  data,  deter¬ 
mines  the  optimal  threshold  above  which  deci¬ 
sion  values  should  produce  positive  classes.  Al¬ 
gorithm  1  performs  a  one-parameter  sweep,  fix¬ 
ing  a  decision  value  threshold  that  optimally  ap¬ 
proximates  the  a  priori  class  distributions  in  the 
training  data.  We  used  this  threshold  to  assign 
classified  documents  to  the  positive  and  negative 
classes,  prior  to  ranking  their  respective  decision 
values. 

5  Results 

In  figures  1  and  2,  the  results  for  positive  and 
negative  queries  are  displayed,  by  plotting  the 
difference  of  the  produced  MAP  and  R-PREC 


values^  and  the  reference  values.  As  can  be 
seen,  the  runs  for  the  positive  queries  produce 
well  above  median  scores  for  both  MAP  and  R- 
PREC.  Averaged  over  the  5  baseline  runs,  for 
the  positive  queries,  a  portion  of  62.3%  is  equal 
to  or  above  the  reference  median  average  preci¬ 
sion.  For  the  R-PREC  scores  for  positive  queries 
this  portion  is  on  average  70.5%.  The  R-PREC 
scores  produced  by  the  5  positive  baseline  runs 
were  all  significantly^  better  than  the  median 
R-PREC  reference  scores.  The  average  differ¬ 
ence  between  produced  R-PREC  and  reference 
R-PREC  was  -hl2.1%.  For  the  negative  queries, 
on  average,  24.7%  of  all  MAP  scores  produced 
were  equal  to  or  above  the  reference  MAP  val¬ 
ues.  For  R-PREC,  a  much  higher  proportion  of 
on  average  57.8%  scores  was  equal  to  or  above 
median  reference  R-PREC.  The  averaged  dif¬ 
ference  over  all  5  runs  for  R-PREC  compared 
with  reference  R-PREC  was  -1.7%.  In  4  out  of 
5  runs,  this  difference  was  significant,  its  average 
amounting  to  a  rather  small  -1.7%. 

The  percentages  of  deviations  are  listed  in  ta¬ 
ble  1,  as  well  as  the  results  of  the  Wilcoxon 
signed  rank  applied  to  the  R-PREC  results. 


Task 

1  %  MAP  1 

1  %  R-PREC  1 

1  MAP  1 

1  R-PREC 

1  W 

POSITIVE  QUERIES 

basel 

+66.9 

+75.7 

26.2 

21.3 

+  14.7 

base2 

+53.4 

+60.8 

19.4 

15.4 

+7.9 

baseS 

+64.2 

+  73 

24.1 

19.4 

+  12.5 

base4 

+64.2 

+  71.6 

23.8 

19 

+  12.3 

baseS 

+62.8 

+  71.6 

24.8 

19.7 

+  13.2 

Average 

+62.3 

+  70.5 

16.2 

11.5 

+  12.1 

NEGATIVE  Qi 

UERIES 

basel 

+22.7 

+61 

8.7 

4.5 

-1.3 

base  2 

+  14.9 

+49.7 

6.7 

3.5 

-3.3 

baseS 

+27.7 

+  58.9 

7.7 

4 

-2.3 

base4 

+31.2 

+  59.6 

8.3 

4.5 

-1.6 

baseS 

+27 

+  59.6 

8.4 

4 

= 

Average 

+  24.7 

+57.8 

10 

6.7 

-1.7 

Table  1:  Percentage  of  queries  with  MAP  scores 
above/below  (+/-)  median  average  precision;  per¬ 
centage  of  queries  with  R-PREC  scores  above/below 
median  R-PREC;  average  MAP  and  average  R- 
PREC  scores  for  the  5  polarity  baselines;  Wilcoxon 
significance  of  the  difference  of  the  obtained  score 
with  the  reference  score  {p  <  .5),  as  well  as  the  differ¬ 
ence  of  the  average  obtained  score  with  the  average 
reference  score. 


^Mean  Average  Precision  and  Precision  at  R  (with  R 
the  number  of  relevant  documents). 

^All  significance  results  were  computed  with  the  non- 
parametric  Wilcoxon  signed  rank  test,  with  p  <  .5. 
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Figure  1:  Difference  of  TNO  produced  MAP  and  R- 
PREC  values  with  the  TREC  reference  values  for 
positive  queries  (queries  sorted  by  descending  per¬ 
formance)  . 


6  Conclusions 

In  this  paper,  we  presented  the  TNO  approach 
to  polarity  classification  and  ranking  of  the  Blog 
TREC  2008  data.  For  5  baseline  runs,  we  ap¬ 
plied  a  geodesic  kernel  to  character  n-gram  rep¬ 
resentations.  We  trained  our  system  on  a  rela¬ 
tive  small  portion  of  16%  of  the  total  available 
training  data.  Results  show  that  our  system 
performs  well  above  median  for  positive  queries. 
For  negative  queries,  results  are  in  4  out  of  5 
runs  below  median,  albeit  with  a  small  (but  sig¬ 
nificant)  percentage.  As  a  lesson  learned,  in  fu¬ 
ture  TREC  participation,  we  will  invest  more 
time  in  thorough  data  cleaning  prior  to  classi¬ 
fier  training  and  testing. 
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Figure  2:  Difference  of  TNO  produced  MAP  and  R- 
PREC  values  with  the  TREC  reference  values  for 
negative  queries  (queries  sorted  by  descending  per¬ 
formance)  . 
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